Fix: Make the janitor best effort by themisvaltinos · Pull Request #5802 · SQLMesh/sqlmesh

themisvaltinos · 2026-05-21T10:44:54Z

This updates the janitor instead of stopping on the first drop failure, to attempt every cleanup operation and aggregates all failures into a single error raised at the end. The warn_on_delete_failure is also respected so in that case it is set to true it would warn rather than raise.

It also introduces the force delete option. If the janitor fails to drop a view, schema or snapshot table it retains the corresponding state records so the next run can retry. This is correct for transient failures, but if the failure is permanent the janitor will retry forever and the affected environments and snapshots will never be cleaned up.

In that case, with the --force-delete flag:

sqlmesh janitor --force-delete

purges the state records for expired environments and snapshots even when their physical objects could not be dropped. But any objects that were not successfully dropped becomes orphaned and must be removed manually from the data warehouse.

Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>

georgesittas · 2026-05-21T13:59:21Z

+            num_expired_snapshots += len(batch.expired_snapshot_ids)
+        except Exception as e:
+            message = f"Failed to clean up an expired snapshots batch: {e}"
+            logger.warning(message)


crericha · 2026-05-21T14:37:05Z

-        self.state_sync.delete_expired_environments(current_ts=current_ts)
+        # we want to retry on the next janitor pass if drops failed
+        if not failures:
+            self.state_sync.delete_expired_environments(current_ts=current_ts)


In the case where a schema was permanently deleted by the user, prior to the janitor running, won't this cause the environment to never expire? I feel like we need an escape hatch.

so this pr unblocks the janitor so it doesn't accumulate stale snapshots in the case of one failure and it can continue with the rest of the operations

this guard to not delete the expired environments in this case of a failure is matching the current behaviour we have of retrying the next time that the janitor runs rather than deleting the expired environments and leaving orphans in the db. the test: test_janitor_cleanup_order is an example of this behaviour that we currently want

I see the value of an escape hatch for the cases where drops persistently fail, but this would require a more deliberate mechanism. I can't think of a simple way to do it as part of this pr without changing the janitor design substantially unless you have a suggestion of a simple way

Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>

Copilot

Pull request overview

Updates SQLMesh’s janitor workflow to be “best effort”: it attempts all cleanup operations, collects failures, and surfaces them together at the end (with optional warning-only behavior and a --force-delete mode to purge state even when physical drops fail).

Changes:

Make environment view/schema/catalog cleanup return a list of failures instead of raising immediately.
Make expired snapshot deletion best-effort, optionally deleting snapshot state even when physical cleanup fails (force_delete).
Aggregate and surface janitor failures in Context._run_janitor, add CLI plumbing and expand test coverage.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/core/test_snapshot_evaluator.py	Updates expectations for snapshot cleanup failures now surfaced as `SQLMeshError`.
tests/core/test_janitor.py	Adjusts janitor unit tests for best-effort cleanup and failure collection.
tests/core/integration/test_aux_commands.py	Adds/updates integration tests for aggregated failures, warn-only mode, and `force_delete`.
sqlmesh/core/snapshot/evaluator.py	Switches snapshot cleanup to collect concurrent failures and raise a combined error.
sqlmesh/core/janitor.py	Returns failure lists for cleanup helpers; adds `force_delete` support for snapshot state deletion on failure.
sqlmesh/core/context.py	Aggregates janitor failures and applies warn/raise behavior; adds `force_delete` parameter plumbing.
sqlmesh/cli/main.py	Adds `--force-delete` option and passes it through to `Context.run_janitor`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            failure_string = "\n  - ".join(failures)
+            summary = f"Janitor completed with failures:\n  {failure_string}"


+        if errors:
+            errored_snapshots = "\n".join(f"  {e.node.name}: {e.__cause__}" for e in errors)
+            raise SQLMeshError(f"\n{errored_snapshots}")


            message = f"Failed to drop the expired environment view '{expired_view}': {e}"
-            if warn_on_delete_failure:
-                logger.warning(message)
-            else:
-                raise SQLMeshError(message) from e
+            logger.warning(message)
+            failures.append(message)


        except Exception as e:
            message = f"Failed to drop the expired environment schema '{schema}': {e}"
-            if warn_on_delete_failure:
-                logger.warning(message)
-            else:
-                raise SQLMeshError(message) from e
+            logger.warning(message)
+            failures.append(message)


            except Exception as e:
                message = f"Failed to drop the expired environment catalog '{catalog}': {e}"
-                if warn_on_delete_failure:
-                    logger.warning(message)
-                else:
-                    raise SQLMeshError(message) from e
+                logger.warning(message)
+                failures.append(message)


Fix: Make the janitor best effort

96bf873

Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>

themisvaltinos requested review from izeigerman and vchan May 21, 2026 11:23

georgesittas reviewed May 21, 2026

View reviewed changes

tobymao approved these changes May 21, 2026

View reviewed changes

crericha reviewed May 21, 2026

View reviewed changes

themisvaltinos added 2 commits May 21, 2026 22:59

add force-delete flag to delete from state even on failures

cc82ab0

Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>

adapt test

c6f8eb4

Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>

vchan approved these changes May 21, 2026

View reviewed changes

vchan requested a review from Copilot May 21, 2026 22:00

Copilot started reviewing on behalf of vchan May 21, 2026 22:00 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

themisvaltinos merged commit c5dbce5 into main May 21, 2026
32 checks passed

themisvaltinos deleted the themis/janitorcleanup branch May 21, 2026 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Make the janitor best effort#5802

Fix: Make the janitor best effort#5802
themisvaltinos merged 3 commits into
mainfrom
themis/janitorcleanup

themisvaltinos commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

georgesittas May 21, 2026

Uh oh!

crericha May 21, 2026

Uh oh!

themisvaltinos May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		failure_string = "\n - ".join(failures)
		summary = f"Janitor completed with failures:\n {failure_string}"

Conversation

themisvaltinos commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

georgesittas May 21, 2026

Choose a reason for hiding this comment

Uh oh!

crericha May 21, 2026

Choose a reason for hiding this comment

Uh oh!

themisvaltinos May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

themisvaltinos commented May 21, 2026 •

edited

Loading